# Let's avoid messages and warnings in SA_Amazon_Insights&Results.html. Anyway, messages and warnings produced by the code have already been dealt with.  
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)

# The next opts_chunk centers figures.
knitr::opts_chunk$set(out.width = "100%", fig.align = "center")

# The next instruction facilitates table layout in HTML.
options(knitr.table.format = "html")

# I use the string <br> to generate empty lines.

*
*       *



1 Executive Summary

88 % prediction accuracy has been reached on the validation set, against 50 % with a baseline model. Data is an Amazon sample provided in UCI Machine Learning Repository.

In this sentiment analysis project, which factors have contributed towards that improvement with 38 percentage points?

Natural Language Processing has contributed 21.7 percentage points: corpus, lowercasing, punctuation handling, stopword removal, stemming, tokenization from sentences into words, bag of words.

Text mining has brought additional accuracy improvement with 12.7 percentage points. The following insights have been determinant.

In decision trees predominate some tokens conveying subjective information; but other tokens containing subjective information have not been used in false negatives and false positives. Such ignored subjective information has been retrieved from random samples of false negatives and false positives, exclusively on the training set; customized lists have been established with tokens sorted as having either positive or negative sentiment orientation; occurrences of these tokens in reviews have been replaced with either a positive or a negative generic token. Polarization and text substitution have brought 10.3 percentage points out of the 12.7.

Another insight has been about negation impact: negation has been fruitfully integrated, contributing 2.4 percentage points towards the 12.7 improvement from text mining.

Machine learning optimization has been performed across 10 models. Testing has been conducted on accuracy distributions across bootstrapped resamples. eXtreme Gradient Boosting has emerged as the most performing model in this project and has boosted accuracy with 3.6 additional percentage points.


TAGS: sentiment analysis, natural language processing, text mining, subjective information, tokenization, bag of words, word frequency, interactive wordclouds, graphs, and tables, decision trees, false negatives, false positives, text classification, polarization, lists of positive n-grams, lists of negative n-grams, text substitution, machine learning, binary classification, eXtreme Gradient Boosting, Monotone Multi-Layer Perceptron Neural Network, Random Forest, Stochastic Gradient Boosting, Support Vector Machines with Radial Basis Function Kernel, AdaBoost Classification Trees, bootstrapping, accuracy distributions across resamples, R


GITHUB: https://github.com/Dev-P-L/Sentiment-Analysis


2 Foreword

Dear Readers, you are most welcome to run the project on your own computer if you so wish.

This project is lodged with the GitHub repository https://github.com/Dev-P-L/Sentiment-Analysis.

It is comprised of twelve files. All code is included in SA_Amazon_Code.Rmd. It does not show in the result report, called SA_Amazon_Insights&Results.html.

For your convenience, the dataset has already been downloaded onto the GitHub repository wherefrom it will be automatically retrieved by the code from SA_Amazon_Code.Rmd. If you so wish, you can also easily retrieve the dataset from https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences and adapt the SA_Amazon_Code.Rmd code accordingly.

You can knit SA_Amazon_Code.Rmd (please in HTML) and produce SA_Amazon_Insights&Results.html on your own computer. Before knitting SA_Amazon_Code.Rmd (please in HTML) on your computer, don’t forget to copy the file styles.css from the GitHub repository into the same folder as SA_Amazon_Code.Rmd.

On my laptop, running SA_Amazon_Code.Rmd takes approximately four hours. For information about my work environment, see the session info at the end of this document.

Some packages are required in SA_Amazon_Code.Rmd. The code from SA_Amazon_Code.Rmd contains instructions to download these packages if they are not available yet.

# I. CLEANING USER INTERFACE FOR RAM MANAGEMENT.

# a. Clearing plots
invisible(if(!is.null(dev.list())) dev.off())

# b. Cleaning workspace
rm(list=ls())

# c. Cleaning console
cat("\014")
########################################################################

# II. PACKAGES.

# a. Installing packages if necessary.

if(!require(tidyverse)) install.packages("tidyverse", repos = "http://cran.us.r-project.org")
if(!require(tm)) install.packages("tm", repos = "http://cran.us.r-project.org")
if(!require(SnowballC)) install.packages("SnowballC", repos = "http://cran.us.r-project.org")
if(!require(e1071)) install.packages("e1071", repos = "http://cran.us.r-project.org")
if(!require(wordcloud2)) install.packages("wordcloud2", repos = "http://cran.us.r-project.org")
if(!require(RColorBrewer)) install.packages("RColorBrewer", repos = "http://cran.us.r-project.org")
if(!require(caTools)) install.packages("caTools", repos = "http://cran.us.r-project.org")
if(!require(rpart)) install.packages("rpart", repos = "http://cran.us.r-project.org")
if(!require(rpart.plot)) install.packages("rpart.plot", repos = "http://cran.us.r-project.org")
if(!require(caret)) install.packages("caret", repos = "http://cran.us.r-project.org")
if(!require(kernlab)) install.packages("kernlab", repos = "http://cran.us.r-project.org")
if(!require(fastAdaboost)) install.packages("fastAdaboost", repos = "http://cran.us.r-project.org")
if(!require(randomForest)) install.packages("randomForest", repos = "http://cran.us.r-project.org")
if(!require(gbm)) install.packages("gbm", repos = "http://cran.us.r-project.org")
if(!require(xgboost)) install.packages("xgboost", repos = "http://cran.us.r-project.org")
if(!require(monmlp)) install.packages("monmlp", repos = "http://cran.us.r-project.org")
if(!require(kableExtra)) install.packages("kableExtra", repos = "http://cran.us.r-project.org")
if(!require(gridExtra)) install.packages("gridExtra", repos = "http://cran.us.r-project.org")
if(!require(utf8)) install.packages("utf8", repos = "http://cran.us.r-project.org")
if(!require(devtools)) install.packages("devtools", repos = "http://cran.us.r-project.org")
if(!require(plotly)) install.packages("plotly", repos = "http://cran.us.r-project.org")
if(!require(htmltools)) install.packages("htmltools", repos = "http://cran.us.r-project.org")
if(!require(DT)) install.packages("DT", repos = "http://cran.us.r-project.org")
if(!require(utils)) install.packages("utils", repos = "http://cran.us.r-project.org")

# b. Requiring libraries.

library(tidyverse)
library(tm)
library(SnowballC)
library(e1071)
library(wordcloud2)
library(RColorBrewer)
library(caTools)
library(rpart)
library(rpart.plot)
library(caret)
library(kernlab)
library(fastAdaboost)
library(randomForest)
library(gbm)
library(xgboost)
library(monmlp)
library(kableExtra)
library(gridExtra)
library(utf8)
library(devtools)
library(plotly)
library(htmltools)
library(DT)
library(utils)

# c. Preventing silently failing after the first wordcloud2.

# See https://github.com/Lchiffon/wordcloud2/issues/65 .
devtools::install_github("gaospecial/wordcloud2")

########################################################################

# III. COLOR PALETTE

dark_cerulean <- "#08457E"
dodger_blue <- "#0181ff"
greenish_blue <- "#507786"
light_gray <- "#808080"
super_light_gray <- "#a7a7a7"
harvard_crimson <- "#a41034"
light_taupe <- "#b38b6d"
super_light_taupe <- "#d6c0b0"
paris_green <- "#50C878"

# For other hues, preexisting denominations will be used such as "powderblue", "mistyrose", etc. 

Now, let’s turn to data.


3 Data

As explained on the UCI Machine Learning Repository website, data is organized in a CSV file in two columns. In the first column, there are 1,000 Amazon product reviews (sentences). In the second column, there is a positive or negative evaluation; the ratio of positive evaluations is 50 %.

That file will be split into training reviews - two thirds of reviews - and validation reviews. Let’s have a quick look at the number of positive and negative reviews in the training set.


# Downloading data.

myfile <- "https://raw.githubusercontent.com/Dev-P-L/Sentiment-Analysis--Amazon-Review--Revisited/master/amazon_cells_labelled.txt"
reviews <- read.delim(myfile, header = FALSE, sep = "\t", quote = "", 
                      stringsAsFactors = FALSE)
rm(myfile)

reviews <- reviews %>% 
  `colnames<-`(c("text", "sentiment")) %>%
      # Replacing numerical variable "sentiment" (0/1 values)
      # with factor variable "sentiment" (Neg/Pos values).
  mutate(sentiment = as.factor(gsub("1", " Pos", 
         gsub("0", "Neg", sentiment)))) %>% as.data.frame()
      # The leading white space character in " Pos" 
      # cares for " Pos" coming first in the confusion matrix
      # so that a "true positive" (review that is predicted positive
      # and is actually positive) corresponds to positive review polarity.

# Creating training index and validation index.

set.seed(1)
ind_train <- createDataPartition(y = reviews$sentiment, 
                                 times = 1, p = 2/3, list = FALSE)
ind_val <- as.integer(setdiff(1:nrow(reviews), ind_train))

# ind_train allows to select the reviews that will be used for training, 
# be it in NLP, in text mining or in ML.

# Building up the training set with the training index. 

reviews_training <- reviews[ind_train, ] %>% 
  as.data.frame() %>% 
  `rownames<-`(1:nrow(.)) %>% 
  mutate(ro = rownames(.)) %>%
  select(ro, everything())

# Some simple statistics in a table: numbers of positives reviews and of negative ones. 

tab <- table(reviews_training$sentiment) %>%
  as.data.frame() %>%
  `colnames<-`(c("Review Polarity", 
                 "Number of Reviews in Training Set"))

knitr::kable(tab, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, width = "2.5in", bold = T) %>%
  column_spec(2, width = "3in", bold = T) %>%
  row_spec(1, color = "white", background = greenish_blue) %>%
  row_spec(2, color = "white", background = harvard_crimson)
Review Polarity Number of Reviews in Training Set
Pos 334
Neg 334
rm(tab)


Let’s have a look at training reviews.


# Building up data frame.

tab <- reviews_training %>% 
  `colnames<-`(c("Row Number", "Training Review", "Sentiment"))

# Building up interactive presentation table.

datatable(tab, rownames = FALSE, filter = "top", 
          options = list(pageLength = 10, scrollX = T,
                         
          # Setting background color and font color in header.               
                         
            initComplete = JS(
              'function(settings, json) {',
              '$(this.api().table().header()).css({
                  "background-color": "#507786", 
                  "color": "white"});', 
              '}'),
            
            # Setting background color in rows. 
            
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "powderblue";','}',
              '}')
            )
          )
rm(tab)


In order to better catch the relationship between the reviews and the reviews sentiment polarity, let’s proceed to some Natural Language Processing. The idea is to detect words and expressions that impact sentiment polarity.


4 NLP

We have seen before that 50 % of reviews have positive sentiment polarity; of course, also 50 % of reviews have negative sentiment polarity.

Consequently, we cannot apply the base model in prediction. Indeed, considering that all reviews have e.g. positive polarity would deliver 50 % true positives and 50 % false positives.

We do need additional information to predict. We are going to retrieve that information from words. So, let’s identify words.

To do so, we are going

  • to create a corpus of the words from training reviews;
  • to process these words in NLP through lowercasing, punctuation removal, stopwords removal, stemming;
  • to produce a bag of words or document term data frame;
  • to check up NLP output;
  • and to measure NLP impact on review polarity prediction.


4.1 Bag of Words

Training reviews will be transposed into a corpus. Then the corpus will be processed in NLP: words will be lowercased, punctuation marks will be removed as well as stopwords and finally words will be stemmed.

Tokenization will then take place, a bag of words being created. The bag of words takes the form of a Document Term Matrix: the 668 rows correspond to the 668 training reviews; there is a column for each token. At the junction of each row and each column, there is a frequency number representing the occurrence of the corresponding token in the corresponding review.

Applying a sparsity threshold of .995 will only leave tokens that appear in at least 0.5 % of reviews.

As a pre-attentive insight, a wordcloud will show the most frequent tokens. The wordcloud is interactive: just hover over a token and you get the frequency of occurrence.

# Corpus is created on training reviews only to avoid any interference between training reviews and validation reviews. Otherwise, tokens from validation set could (slightly) impact token selection when applying the sparsity threshold. 

corpus <- VCorpus(VectorSource(reviews_training$text)) 

# Lowercasing, removing punctuation and stopwords, stemming document.

corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stemDocument)

# Building up a bag of words in a Document Term Matrix.

dtm <- DocumentTermMatrix(corpus)

# Managing sparsity with sparsity threshold. 

sparse <- removeSparseTerms(dtm, 0.995)

# Converting sparse, which is a DocumentTermMatrix, to a matrix and then to a data frame.

sentSparse <- as.data.frame(as.matrix(sparse)) 

# Making all column names R-friendly.
colnames(sentSparse) <- make.names(colnames(sentSparse))

# In order to get some pre-emptive insights into the bag of words, let's use a wordcloud. 

# First, let's build up a data frame with only the 40 most frequent tokens from "sentSparse", i.e. the Document Term Matrix pruned by the sparsity process. 

df <- data.frame(word = colnames(sentSparse), 
                 freq = colSums(sentSparse)) %>%
  filter(freq >= 10) %>%
  arrange(desc(freq)) %>%
  head(., 40)

# Second, let's create the wordcloud. Numerous colors are used to easily dissociate tokens.

set.seed(1)
wordcloud2(df, shape = 'square', color = 'random-dark',
           backgroundColor = super_light_taupe, shuffle = FALSE)

There are topic-related tokens such as “phone”, tokens conveying subjective information such as “great”, etc. Before analyzing token categories, let’s check up the technical adequacy of results from the NLP process.


4.2 Checking

The wordcloud above is an ergonomic tool to easily pinpoint some NLP flaws.


4.2.1 Short Forms

Some tokens were not expected, such as “dont” or “ive”, since they seem to originate in short forms and were expected to have been eliminated as stopwords.

Let’s start investigating with “dont”. The frequency of occurrence is at least 10 since that is a prerequisite to enter the wordcloud. But there can be more instances.


# In the training reviews, which rows contain a digit at least equal to 1 in the column "dont"? 

bin <- which(sentSparse$dont >= 1)

# Building up a small presentation table.

df <- data.frame(length(bin)) %>% 
  `colnames<-`('Number of Reviews Containing "dont"') %>%
  `rownames<-`("Bag of Words from Training Reviews")

knitr::kable(df, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2, bold = T, color = "white", background = harvard_crimson) 
Number of Reviews Containing “dont”
Bag of Words from Training Reviews 20
rm(df)

# Keeping bin for later use.


Perusing the bag of words for rows containing “dont” has led to distinguishing two scenarios. The first one is an exception, but it can be generalized to other tokens. Here it is.


df <- data.frame(reviews_training$ro[bin[17]], 
                 reviews_training$text[bin[17]]) %>%
  `colnames<-`(c("Training Review Number",
               '"dont" Originating in Misspelling')) 

knitr::kable(df, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  row_spec(1, bold = T, color = "white", background = harvard_crimson) 
Training Review Number “dont” Originating in Misspelling
544 dont buy it.
rm(df)


“dont” contains a spelling error or is, in a more inclusive wording, “alternative” grammar: it has been used instead of “don’t”. Actually, there is only one such case in the bag of words. But it could happen more often and also with other short forms such as “couldn’t”, “isn’t”, … becoming “couldnt”, “isnt”, …

We are going to treat these misspelled short forms as if they were standardly written. We will complement stopwords with variants such as “dont”, “couldnt”. Consequently, when we remove stopwords, the misspelled short forms can be eradicated as well as the standardly written short forms, at least for the mispelled short forms we can think of… Complementing stopwords with misspelled short forms will be done in the next section “Fine Tuning NLP”.

Now, let’s have a look at the most common scenario that has generates “dont”. Let’s just show the one review with two occurrences.


# Localizing the cases, i.e. all cases except the one in the first scenario above, which originated in a misspelling. 
bin_2 <- bin[-17]

# Building up data frame. 
tab <- reviews_training[bin_2, ]

tab <- tab %>%
  `colnames<-`(c("Training Review Number",
                 "\"dont\" Originating in \"don't\"",
                 "Sentiment")) 

# Building up interactive presentation table.

datatable(tab, rownames = FALSE, filter = "top", 
          
          options = list(pageLength = 10, scrollX = T,
                         
          # Setting background color and font color in header.               
                         
            initComplete = JS(
              'function(settings, json) {',
              '$(this.api().table().header()).css({
                  "background-color": "#A41034", 
                  "color": "white"});', 
              '}'),
            
            # Setting background color in rows. 
            
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


This is the general scenario: “don’t” has been standardly written and it was expected to have disappeared as all stopwords and nevertheless it is still in the bag of words since we have seen it in the bag of words wordcloud.

What happened? Before stopword removal, all punctuation marks have been removed and consequently “don’t” has become “dont”; it is no longer identical to the stopword “don’t” and, very logically, it has not been removed.

This scenario happened in 19 reviews and, without change, it would happen for all short forms that include an apostrophe.

In order to prevent that scenario from happening, there are simple solutions, e.g.:

  • discarding stopwords, and consequently short forms, before removing punctuation;
  • or, removing punctuation marks with the exception of apostrophes, discarding stopwords, and consequently short forms, and only then removing the remaining apostrophes (apostrophes present at other places than in short forms).

An appropriate solution will be applied in the next section “Fine Tuning NLP”.

Now, it is time we switched to another NLP flaw that is perceptible in the bag of words wordcloud above: words collapse.


4.2.2 Words Collapse

Let’s have a look at the whole bag of words (obtained before applying the sparsity process).

# Collecting all tokens, upstream of the sparsity process, which the token "brokeni" couldn't pass since there is only one instance of "brokeni"!

tokens <- findFreqTerms(dtm, lowfreq = 1) %>%
  as.data.frame() %>%
  `colnames<-`("Every Token from the Whole Bag of Words")

# Instead of "findFreqTerms(dtm, lowfreq = 1)" 
# we could also have used "colnames(dtm)" ...

# Building up interactive presentation table.

datatable(tokens, rownames = FALSE, filter = "top", 
          
          options = list(width = "450px", pageLength = 10, scrollX = F,
            
            # Centers the single datatable column (column 0).
            
            columnDefs = list(list(className = 'dt-center', targets = 0)),
            
            # Sets background color and font color in header.   
            
            initComplete = JS(
              'function(settings, json) {',
              '$(this.api().table().header()).css({
                  "background-color": "#A41034", 
                  "color": "white"});', 
              '}'),
            
            # Sets background color in rows. 
            
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


First, there are several numbers. Numbers will be removed.

Second, some unigrams seem to originate from two words:

  • “abovepretti”,
  • “brokeni”,
  • “buyit”,
  • “replaceeasi”,
  • “unacceptableunless”,
  • etc.

Let’s check whether e.g. “brokeni” originates in words collapse.


# We have to work on all tokens, upstream of the sparsity process, which the token "brokeni" couldn't pass since there is only one instance of "brokeni"! The corpus meets this requirement: it contains all tokens. Let's extract the row number(s) generating "brokeni".

v <- 1:length(corpus)
for(i in v) {
  v[i] <- length(grep("brokeni", corpus[[i]]$content))
}

# Second, retrieving the corresponding review. 
df <- data.frame(
  reviews_training$ro[which(v >= 1)],
  reviews_training$text[which(v >= 1)], 
  stringsAsFactors = FALSE) %>%
  `colnames<-`(c("Review Row Number", 
                 'Review Producing "brokeni"'))

knitr::kable(df, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, width = "2in") %>%
  row_spec(1, bold = T, color = "white", background = harvard_crimson) 
Review Row Number Review Producing “brokeni”
381 I got the car charger and not even after a week the charger was broken…I went to plug it in and it started smoking.
rm(v, i, df)


What happened? Well, “broken…I” was first lowercased to “broken…i”, then punctuation was removed by the function removePunctuation(), which does not insert any white space character, and “broken…i” has become “brokeni”.

This has to be corrected of course for “brokeni” but also for similar cases. In the next section “Fine Tuning NLP”, a general solution will be applied.


4.3 Fine Tuning

Instead of using the function removePunctuation() from the package tm, specific “for loops” will be developed, preprocessing reviews according to the needs stated above and in a stepwise way:

  • punctuation marks other than apostrophes will be replaced with white space characters instead of just being removed;
  • short forms will be removed;
  • remaining apostrophes will be replaced with white space characters;
  • other stopwords will be removed (it is done in step 4 and not in step 2 in order to do it when absolutely all punctuation marks have been removed: please see example with “brokeni” where two words and one punctuation mark are stuck together…).

Among stopwords, short forms (contractions) need to be specifically treated. Additional needs of breakdown might also emerge. Starting from the stopword list delivered by the function stopwords(“english”) from the package tm, four CSV files will be produced.

These are the four files:

  • short_forms_pos.csv, with all positive short forms from stopwords(“english”) such as “she’s”, a few additional ones and numerous misspelled variants such as “she s” or “shes”;
  • short_forms_neg.csv, in the same approach, for short forms such as “isn’t”, “daren’t” but also “isn t”, “isnt”, etc.;
  • negation.csv, with seven negational unigrams such as “not” or “no”;
  • stopwords_remaining.csv, which is self-explanatory.

The 4 files have been uploaded to the GitHub repository https://github.com/Dev-P-L/Sentiment-Analysis. They are going to be downloaded now and integrated into NLP pre-processing.

Let’s rebuild the corpus, the bag of words and the interactive wordcloud (just hover over tokens to get the frequency of occurrence).

# Downloading the 4 files described above and preparing them in order to rebuild the corpus, the bag of words (the Document Term  Matrix) and the wordcloud. 

myfile <- "https://raw.githubusercontent.com/Dev-P-L/Sentiment-Analysis/master/short_forms_pos.csv"
short_forms_pos <- read.csv(myfile, header = FALSE, 
                            stringsAsFactors = FALSE)
short_forms_pos <- short_forms_pos[, 2] %>% as.vector()

# Normalizing (among others, apostrophes). 
short_forms_pos <- sapply(short_forms_pos, utf8_normalize, 
                          map_quote = TRUE)

myfile <- "https://raw.githubusercontent.com/Dev-P-L/Sentiment-Analysis/master/short_forms_neg.csv"
short_forms_neg <- read.csv(myfile, header = FALSE, 
                            stringsAsFactors = FALSE)
short_forms_neg <- short_forms_neg[, 2] %>% as.vector()

# Normalizing (among others, apostrophes). 
short_forms_neg <- sapply(short_forms_neg, utf8_normalize, 
                          map_quote = TRUE)

myfile <- "https://raw.githubusercontent.com/Dev-P-L/Sentiment-Analysis/master/negation.csv"
negation <- read.csv(myfile, header = FALSE, 
                     stringsAsFactors = FALSE) 
negation <- negation[, 2] %>% as.vector()

myfile <- "https://raw.githubusercontent.com/Dev-P-L/Sentiment-Analysis/master/stopwords_remaining.csv"
stopwords_remaining <- read.csv(myfile, header = FALSE, 
                                stringsAsFactors = FALSE) 
stopwords_remaining <- stopwords_remaining[, 2] %>% 
  as.vector()

rm(myfile)

# Creating and preprocessing corpus again.

corpus_av0 <- VCorpus(VectorSource(reviews_training$text)) 
corpus_av0 <- tm_map(corpus_av0, content_transformer(tolower))

# Replacing all punctuation marks other than apostrophes with white space 
# characters, instead of simply suppressing punctuation marks, not to risk 
# collapsing two or more words into one. 
# But keeping apostrophes to leave intact short forms such as "don't" 
# and be able to identify them as short forms 
# and as such to discard them. 

for (i in 1:nrow(reviews_training)) {
  corpus_av0[[i]]$content <- gsub("(?!')[[:punct:]]", " ", 
                              corpus_av0[[i]]$content, perl = TRUE)
}
rm(i)

# Removing extra white space characters 
# (= removing all white space characters except one in a sequence).
# Then removing short forms.

corpus_av0 <- tm_map(corpus_av0, stripWhitespace)
corpus_av0 <- tm_map(corpus_av0, removeWords, short_forms_neg)
corpus_av0 <- tm_map(corpus_av0, removeWords, short_forms_pos)

# Replacing all remaining apostrophes with white space characters 
# (there might be other apostrophes than in short forms...). 

for (i in 1:nrow(reviews_training)) {
  corpus_av0[[i]]$content <- gsub("[[:punct:]]", " ",                                                          corpus_av0[[i]]$content)
}
rm(i)

# Removing n-grams from other files. 

corpus_av0 <- tm_map(corpus_av0, removeWords, negation)
corpus_av0 <- tm_map(corpus_av0, removeWords, stopwords_remaining)

# Stemming words.

corpus_av0 <- tm_map(corpus_av0, stemDocument)

# Removing numbers and extra white space characters.

corpus_av0 <- tm_map(corpus_av0, removeNumbers)
corpus_av0 <- tm_map(corpus_av0, stripWhitespace)

# Building up a bag of words in a Document Term Matrix.

dtm_av0 <- DocumentTermMatrix(corpus_av0)

# Managing sparsity with the sparsity threshold. 

sparse_av0 <- removeSparseTerms(dtm_av0, 0.995)

# Converting sparse_av0, which is a DocumentTermMatrix, 
# to a matrix and then to a data frame.

sentSparse_av0 <- as.data.frame(as.matrix(sparse_av0)) 

# Making all column names R-friendly.

colnames(sentSparse_av0) <- make.names(colnames(sentSparse_av0))

# Let's check whether shortcomings have disappeared or not
# by building up a wordcloud with the most frequent tokens 
# originating from the training reviews.
# Keeping only the 50 most frequent tokens. 

df <- data.frame(word = colnames(sentSparse_av0), 
                 freq = colSums(sentSparse_av0)) %>%
  filter(freq >= 10) %>%
  arrange(desc(freq)) %>%
  head(., 40)

# Building up wordcloud. 

set.seed(1)
wordcloud2(df, shape = 'square', color = 'random-light',
           backgroundColor = greenish_blue, shuffle = FALSE)

In the wordcloud, there is no more token originating from short forms.

Let’s have a broader look, building up a presentation table and checking whether all abovementioned oddities have disappeared. Let’s check up in the bag of words whether “dont” has indeed disappeared.


# Retrieving all tokens, upstream of the sparsity process. 
tokens <- findFreqTerms(dtm_av0, lowfreq = 1)

# Choosing the number of columns of the presentation table. 
nc <- 5

# Calculating the number of missing tokens to have a full matrix. 
mis <- ((ceiling(length(tokens) / nc)) * nc) - length(tokens)

# Building up a table.
tokens <- as.character(c(tokens, (rep("-", mis))))
tokens <- data.frame(matrix(tokens, ncol = nc, byrow = TRUE)) %>%
  `colnames<-`(NULL) %>% `rownames<-`(NULL)

# Looking for "dont".
knitr::kable(tokens[51, ], "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2:6, bold = T, color = "white", background = greenish_blue)
51 dit dock done doubl download
rm(nc, mis)


Yes, indeed, “dont” has disappeared. Let’s check up in the same way for “ive”!


knitr::kable(tokens[96, ], "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2:6, bold = T, color = "white", background = greenish_blue)
96 issu item jabra jawbon jerk


“ive” has also disappeared. Now “brokeni”.


knitr::kable(tokens[21, ], "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2:6, bold = T, color = "white", background = greenish_blue)
21 breakag brilliant broke broken brows


“brokeni” has vanished as well, just as many other oddities.

The next interactive datatable allows to check up for the disappearance of some other oddities.


# Collecting all tokens, upstream of the sparsity process, which the token "brokeni" couldn't pass since there is only one instance of "brokeni"!

tokens <- findFreqTerms(dtm_av0, lowfreq = 1) %>%
  as.data.frame() %>%
  `colnames<-`("Every Token after NLP (but before Sparsity Process)")

# Instead of "findFreqTerms(dtm, lowfreq = 1)" 
# we could also have used "colnames(dtm)" ...

# Building up interactive presentation table.

datatable(tokens, rownames = FALSE, filter = "top", 
          
          options = list(width = "450px",
            
            pageLength = 10, scrollX = F,
            
            # Centers the single datatable column (column 0).
            columnDefs = list(list(className = 'dt-center', targets = 0)),
            
            # Sets background color and font color in header.        
            initComplete = JS(
              'function(settings, json) {',
              '$(this.api().table().header()).css({
                  "background-color": "#A41034", 
                  "color": "white"});', 
              '}'),
            
            # Sets background color in rows. 
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


This interactive datatable allows us to search for other previously pinpointed oddities and to realize that they have indeed disappeared.

By entering tokens in the search box, we can once again easily check that “dont” and “ive” have indeed disappeared.

All short forms have also vanished from the bag of words.

The same holds for “abovepretti”, “replaceeasi” or “unacceptableunless”, which looked like the result from words collapse.

On the contrary, “buyit” has not vanished, because at least once it was written in that way in a review.This can easily be checked up by entering “buyit” in the interactive table above with “Training Review” in the header (interactive table on blue background color).

Numbers have disappeared.

I leave uncorrected some spelling errors, such as “disapoint” or “dissapoint”, because this is no repetitive structure and occurrence seems marginal.

After cleaning the bag of words through NLP, let’s have a first try at predicting sentiment by using tokens as predictors.


5 Predicting after NLP

NLP impact will be computed as the gain in accuracy provided by a standard machine learning model in comparison with the baseline model.

The baseline model accuracy would be 0.50 on the training reviews, since each class (positive sentiment polarity or negative sentiment polarity) is 50 % of the training reviews as already shown.

The chosen machine learning model will be CART: it runs rather quickly and delivers clear decision trees. Running function rpart() on the training set delivers the accuracy level mentioned hereunder.


# Adding dependent variable.

sentSparse_av0 <- sentSparse_av0 %>% 
  mutate(sentiment = reviews_training$sentiment)

# Training CART with the algorithm rpart.

set.seed(1)
fit_cart_av0 <- rpart(sentiment ~., data = sentSparse_av0)
fitted_cart_av0 <- predict(fit_cart_av0, type = "class")
cm_cart_av0 <- confusionMatrix(fitted_cart_av0, sentSparse_av0$sentiment)

# Accuracy level 

df <- data.frame(round(cm_cart_av0$overall["Accuracy"], 4)) %>%
  `rownames<-`("Model: CART") %>% 
  `colnames<-`("Accuracy on the Training Set")

knitr::kable(df, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue)
Accuracy on the Training Set
Model: CART 0.768
rm(df)


Now let’s train the rpart method with the train() function from the package caret.

By default, the train() function would train across 3 values of cp (the complexity parameter) and 25 bootstrapped resamples for each tuned value of cp. As far as the number of tuned values is concerned, let’s upgrade it to 15 to increase the odds of improving accuracy, especially as rpart runs rather quickly.

The default resampling method is bootstrapping, samples being built with replacement, some reviews being picked up twice or more and some other reviews not being selected. This method seems especially appropriate here because the size of each resample will be the same of the size of the training set, which is already limited, i.e. 668. Working with e.g. K-fold cross-validation would imply further splitting the training set.

Will accuracy improve?


# Running rpart on the training set. 

set.seed(1)
fit_cart_tuned_av0 <- train(sentiment ~ .,
                         method = "rpart",
                         data = sentSparse_av0,
                         tuneLength = 15,
                         metric = "Accuracy")
fitted_cart_tuned_av0 <- predict(fit_cart_tuned_av0)
cm_cart_tuned_av0 <- confusionMatrix(as.factor(fitted_cart_tuned_av0), 
                                     as.factor(sentSparse_av0$sentiment))

# The tuned rpart model delivers an accuracy level of

df <- data.frame(round(cm_cart_tuned_av0$overall["Accuracy"], 4)) %>%
  `rownames<-`("Model: CART + cp Tuning") %>% 
  `colnames<-`("Accuracy on the Training Set")

knitr::kable(df, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue)
Accuracy on the Training Set
Model: CART + cp Tuning 0.7814
rm(df)


Accuracy increases from 76.5 to 78.3. For the record, let’s have a look at a graph showing how accuracy evolves across the 15 cp values chosen by the train() function.


graph <-  
  ggplot(fit_cart_tuned_av0) + 
  geom_line(col = greenish_blue, size = 1) +
  geom_point(col = harvard_crimson, size = 4) +
  ggtitle("Average Bootstrap Accuracy across cp Values") +
  xlab("Complexity Parameter") + ylab("Average Accuracy (Bootstrap)") +
  theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
        axis.title.x = element_text(size = 16), 
        axis.title.y = element_text(size = 16), 
        axis.text.x = element_text(size = 12), 
        axis.text.y = element_text(size = 12))

p <- ggplotly(graph, dynamicTicks = TRUE, width = 800, height = 500 )

# Centering the graph, because the centering opts_chunk previously inserted is not operative in the case of the ggplotly() function. 

htmltools::div(p, align = "center" )
rm(graph)


The optimal value of cp is zero. This means that the train() function has kept the decision tree as complex as possible by assigning a zero value to the complexity parameter.

On the graph above, maximum accuracy is a bit lower than the level previously indicated. Why is it different? Because, on the graph, it is, for each cp value, the average accuracy on the 25 bootstrapped resamples, while accuracy previously given related to the whole training set.

On the whole training set, the rpart model without tuning delivers approximately 76.8 % accuracy and the rpart model with tuning 78.1 %. Both levels are substantially higher than accuracy provided by the baseline model.

The baseline model would predict a positive evaluation for all reviews (or alternatively a negative evaluation for all reviews) since prevalence is 50 %. Prevalence should show in the accuracy level delivered by the baseline model on the training set. Let’s check it up.


# Document Term Matrix from training reviews, after Sparsity Process

df <- sentSparse_av0

# Data frame with 2 columns, one with positive sentiment polarity everywhere (baseline model) and one column with actual sentiment polarity

pred_baseline <- 
  data.frame(sentiment = rep(" Pos", nrow(df))) %>%
  mutate(sentiment = factor(sentiment, levels = levels(df$sentiment)))

# Confusion matrix

cm_baseline <- confusionMatrix(pred_baseline$sentiment, 
                               as.factor(df$sentiment)) 

# Presentation table of baseline model accuracy

df <- data.frame(sprintf("%.4f", 
                  round(cm_baseline$overall["Accuracy"], 4))) %>%
      `colnames<-`("Accuracy on the Training Set") %>%
      `rownames<-`("Model: Baseline")

# Layout

knitr::kable(df, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = greenish_blue) %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue)
Accuracy on the Training Set
Model: Baseline 0.5000
rm(df)


Let’s summarize results from the three models, not only with accuracy but also with additional performance metrics.



In the table above, on row 1, fonts have been stricken through to indicate that this model is discarded because if delivers only 50 % accuracy and looks like a dead-end path.

The other two models should be seen as a cumulative process bringing accuracy improvement in a stepwise and incremental way, with the one on green background being the best in accuracy. Models 2 and 3 deliver higher accuracy but also asymmetry between other performance metrics: sensitivity and negative predictive value are lower than specificity and positive predictive value. This reflects false negatives being more numerous than false positives. False negatives are predictions pointing to “Neg” while the reference value is " Pos". This is an insight for text mining: perusing false negatives and coming with actionable findings.

In order to confirm that false negatives are more numerous than false positives, let’s have a look at the confusion matrix for both models. First, the confusion matrix from the rpart model without tuning.


# Metric abbreviations in confusion matrices
name <- c("TP = ", "FN = ", "FP = ", "TN = ")

# Building up confusion matrix data in vector format.
tab <- table(fitted_cart_av0, sentSparse_av0$sentiment) %>% 
  as.vector() %>% paste(name, ., sep = "")

# Ordering data in confusion matrix format and inserting headers in the confusion matrix. 
tab <- data.frame(matrix(tab, ncol = 2, nrow = 2, byrow = FALSE)) %>%
  `colnames<-`(c("Actually positive", "Actually negative")) %>%
  `rownames<-`(c("Predicted positive with CART", 
                 "Predicted negative with CART"))

# Layout
knitr::kable(tab, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = "black") %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue) %>%
  column_spec(3, bold = T, color = "white", background = harvard_crimson)
Actually positive Actually negative
Predicted positive with CART TP = 214 FP = 35
Predicted negative with CART FN = 120 TN = 299
rm(tab)


The weak point lies in the first column, on greenish blue background: the relatively high number of false negatives and, as a corollary, the relatively low number of true positives. On the reference positive class (" Pos" in label), predicting seems problematic or at the very least challenging since false negatives are rife. On the contrary, on the reference negative class (“Neg” in label), predicting has run smoothly, with a satisfactorily low number of false positives.

The tuned rpart model is expected to slightly reduce the excess in false negatives.


# Metric abbreviations in confusion matrices
name <- c("TP = ", "FN = ", "FP = ", "TN = ")

# Building up confusion matrix data in vector format.
tab <- table(fitted_cart_tuned_av0, sentSparse_av0$sentiment) %>% 
  as.vector() %>% 
  paste(name, ., sep = "")

# Ordering data in confusion matrix format and inserting headers in the confusion matrix.
tab <- data.frame(matrix(tab, ncol = 2, nrow = 2, byrow = FALSE)) %>%
  `colnames<-`(c("Actually positive", "Actually negative")) %>%
  `rownames<-`(c("Predicted positive with CART + tuning", 
                 "Predicted negative with CART + tuning"))

# Layout
knitr::kable(tab, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = "#333333") %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue) %>%
  column_spec(3, bold = T, color = "white", background = harvard_crimson)
Actually positive Actually negative
Predicted positive with CART + tuning TP = 236 FP = 48
Predicted negative with CART + tuning FN = 98 TN = 286
rm(tab, tokens, name)
rm(cm_cart_av0, fit_cart_av0, fitted_cart_av0)


With the tuned rpart model, accuracy has slightly improved: the sum of numbers on the main diagonal is larger.

On the green background, predicting on the reference positive class is less prolific in false negatives and, as a corollary, true positives are more predominant.

On the secondary diagonal, imbalance between false negatives and false positives is less marked, not only because there are less false negatives but also because there are more false positives. Nevertheless false negatives remain the weak point, being twice as numerous as false positives.

False negatives - and false positives - will be perused through text mining in the next section, looking for new insights towards accuracy improvement.


6 Text Mining

In this section, we are going to peruse the training reviews leading to false negatives or false positives with the CART model with cp tuning. This will be done with a view to pinpointing words, expressions, or phrases whose sentiment polarity could be used to better predict.

Another question will be raised: should topic-related words and tokens be maintained in the bag of words? Could they have any predictive impact?

Let’s first build an interactive table with all training reviews leading to false negatives or false positives with the CART model with cp tuning. Let’s start with false negatives, because there are more false negatives.


# To identify false negatives, we need both the actual review polarity and the predicted review polarity. Consequently, we are going to combine both variables in one data frame.  

df <- data.frame(sentiment = reviews_training$sentiment,
                 pred = fitted_cart_tuned_av0) 

# We have a false negative if actual review polarity is positive and if predicted review polarity is negative. If CART delivers a false negative for a specific row, then the next command below produces 1; if it is a false positive, the result is -1; a true positive or a true negative gives 0. So, 1 corresponds to what we are looking for, i.e. false negatives, -1 corresponds to false positives and 0 corresponds to either true positives or true negatives.  

FN_train <- ifelse(df$sentiment == " Pos", 1, 0) - 
            ifelse(df$pred == " Pos", 1, 0)

# Now, we have to generate a dichotomic vector with one specific value for false negatives or another specific value for all other cases (false positives, true positives or true negatives). That's exactly what the next command does. Indeed, if the command above gives 1 (false negative), then the command below delivers 1 as well while delivering 0 in all other cases (false positives, true positives or true negatives). 

FN_train <- ifelse(FN_train == 1, 1, 0)

# Row numbers corresponding to false negatives

FN <- which(FN_train == 1)

# Now let's build up an interactive table with all false negatives delivered by CART with cp tuning. 

# Let's create a receptacle data frame.

df_fn <- data.frame(row = FN,
                 review = as.character(1:length(FN)),
                 tokenized = as.character(1:length(FN))) %>%
  `colnames<-`(c("Row", 
                 "Training Reviews leading to False Negatives", 
                 "Tokenized"))

# In order to populate the receptacle data frame, let's build up a for loop garnering data, i.e. row number, training review and tokenized training review.

for (i in 1:length(FN)) {
  row <- FN[i]
  df_fn[i, 2] <- reviews_training$text[row]
  df_fn[i, 3] <- corpus_av0[[row]]$content
}

rm(i, row)

# Creating the interactive data table, using the DT package. 

datatable(df_fn, rownames = FALSE, filter = "top", 
          options = list(pageLength = 10, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


In the interactive table above, if we scroll through false negatives several scenarios appear. Let’s classify false negatives into four scenarios, identified by the pivotal pieces of information that were not used by CART to produce the right polarity, i.e. the positive polarity:

  • subjective information unigrams unused;
  • subjective information multigrams unused;
  • negational unigrams unused;
  • negative short forms unused.


6.1 Subjective Information

When consulting the table of false negatives above, we can pinpoint some subjective information unigrams unused, i.e. words/tokens encompassing some subjective information that points to the right polarity.

These words, and the related standardized tokens, could be classified in several categories. Here, it has been opted for three main categories:

  • words (and tokens) expressing positive emotions, such as “impressed”, “joy”, and “glad”;
  • words (and tokens) expressing appreciation in a non-technical way, such as “fine”, “awesome”, and “rocks”;
  • words (and tokens) expressing technical qualities but outside of precise quantification, such as “fast”, “prompt”, and “sturdy”.

The first category is sentiment-related and so is the second category in most cases and to some degree. The third category relates to technicalities but without quantification. The three categories can be deemed as compliance-related, expressing to some degree compliance with expectations, requirements or advertisements.

To sum it up, the three categories will be referred to altogether in this project using phrases such as “subjective information” or “words conveying subjective information” or “tokens conveying subjective information”.

That subjective information is readily readable from a human point of view. But, in spite of these words/tokens, the polarity has been wrongly read by CART. Why?

Maybe because these words/tokens are not present in the final decision tree? Or maybe because other words/tokens have precedence in the decision tree?

Let’s have a look at the final decision tree delivered by CART with cp tuning.


tree <- prp(fit_cart_tuned_av0$finalModel, uniform = TRUE, cex = 0.8, 
    box.palette = c(super_light_gray, super_light_taupe)) 

# Keeping tree for further use. 


“Unfortunately”, the tokens pinpointed among the false negatives do not show in the decision trees. Let’s visualize the decision tree from the CART model with cp tuning.

What types of tokens can be seen in the decision tree?

There is a majority of tokens conveying subjective information (“great”, “comfort”, “love”, “like”, “disappoi”, etc.). They are usually the highest ranked.

There are also other types of tokens, but at a lower level: - intent-related tokens (“purchas”, “buy”) or - topic-related tokens (“plug”, “ear”).

Which is an interesting insight. In CART, tokens conveying subjective information predominate, which is not at all surprising! This points to solutions allocating higher priority to tokens conveying subjective information.

Although a majority of tokens are conveying subjective information in the decision tree, we do not find that many tokens with subjective information pinpointed among false negatives. It can be a matter of word (or token) frequency. This can be first checked up in the wordcloud that has already been visualized.

# Getting the bag of words without an irrelevant column.

df <- sentSparse_av0[, - ncol(sentSparse_av0)]

# Building up a vector with the 40 most frequent tokens in the bag of words.

temp <- data.frame(word = colnames(df), freq = colSums(df)) %>%
  filter(freq >= 10) %>%
  arrange(desc(freq)) %>%
  head(., 40)

# Creating an interactive wordcloud. 

set.seed(1)
wordcloud2(temp, shape = "square", color = "random-light",
           backgroundColor = greenish_blue, shuffle = FALSE)
rm(df)

# Let's notice that temp is not removed in order to further use it. 

For illustrative purposes, tokens can be visualized in decreasing order of frequency in the interactive histogram below.


# Preparing the histogram. 
graph <-  temp %>% mutate(word = reorder(word, freq)) %>%
  ggplot(aes(word, freq)) + 
  geom_bar(stat = "identity", width = 0.80, 
           color = "#007ba7", fill = "#007ba7") + 
  coord_flip() +
  ggtitle("Token Frequency") +
  xlab("Token") + ylab("Frequency") +
  theme(plot.title = element_text(hjust = 0.5, 
                                  size = 16, face = "bold"),
        axis.title.x = element_text(size = 16), 
        axis.title.y = element_text(size = 16), 
        axis.text.x = element_text(angle = 45, 
                                   hjust = 1, size = 12), 
        axis.text.y = element_text(size = 12))

# Making the graph interactive.
p <- ggplotly(graph, dynamicTicks = TRUE, 
              width = 500, height = 1000 )

# Centering the interactive graph.
htmltools::div(p, align = "center" )
rm(graph, p)

Among tokens depicted in the wordcloud and in the histogram, there are

  • topic-related tokens (“phone”, “batteri”, “headset”, “sound”, “ear”, etc.),
  • intent-related tokens (“purchas”, “buy”),
  • compliance-related tokens, expressing compliance or incompliance with expectations, requirements or advertisements (“fit”, “comfort”, “problem”, etc.),
  • sentiment-related tokens other than in previous category (“love”, “like”, etc.).

Most decision tree tokens appear in the wordcloud (or the histogram). The proportion of decision tree tokens appearing in the wordcloud or in the histogram.

# Collecting decision tree tokens in a character vector.

tree_tokens <- tree$obj$frame$var
tree_tokens <- tree_tokens[!tree_tokens == "<leaf>"]

# Collecting wordcloud tokens. They have already been stored in the data frame "temp", and in particular in the column whose name is "word".

wordcloud_tokens <- temp$word

# Extracting tree tokens that also apppear in the wordcloud and in the histogram.

intersection <- intersect(tree_tokens, wordcloud_tokens)

# Computing proportion of decision tree tokens appearing in the wordcloud or in the histogram.

prop <- length(intersection) * 100 / length(tree_tokens)
prop <- round(prop, 0)
prop <- paste(prop, "%", sep = " ")

# Building up a presentation data frame for the proportion.

tab <- data.frame(prop) %>%
  `colnames<-`("Proportion of Tree Tokens Appearing in the Wordcloud")

# Layout

knitr::kable(tab, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = "white", background = greenish_blue) 
Proportion of Tree Tokens Appearing in the Wordcloud
62 %
rm(tab)

So, there is some correlation between decision tree tokens and wordcloud – or histogram – tokens. Token frequency is the criterion for the wordcloud and matters for the decision tree. But token frequency is not enough to enter the decision tree: tokens need discriminant predictive power. So, “phone” is the wordcloud token with the highest frequency – 116 occurrences – but the decision tree is not comprised of phone; the reason of it seems obvious. On the contrary, great only has 69 occurrences and appears on top of the decision tree.

We can better visualize this when looking at some rpart output.


# class.output = "bg-primary" gives white color and blue background color.

tree$obj
## n= 668 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##        1) root 668 334  Pos (0.50000000 0.50000000)  
##          2) great>=0.5 68   4  Pos (0.94117647 0.05882353) *
##          3) great< 0.5 600 270 Neg (0.45000000 0.55000000)  
##            6) good>=0.5 45   7  Pos (0.84444444 0.15555556) *
##            7) good< 0.5 555 232 Neg (0.41801802 0.58198198)  
##             14) love>=0.5 16   0  Pos (1.00000000 0.00000000) *
##             15) love< 0.5 539 216 Neg (0.40074212 0.59925788)  
##               30) excel>=0.5 17   1  Pos (0.94117647 0.05882353) *
##               31) excel< 0.5 522 200 Neg (0.38314176 0.61685824)  
##                 62) nice>=0.5 12   0  Pos (1.00000000 0.00000000) *
##                 63) nice< 0.5 510 188 Neg (0.36862745 0.63137255)  
##                  126) best>=0.5 16   2  Pos (0.87500000 0.12500000) *
##                  127) best< 0.5 494 174 Neg (0.35222672 0.64777328)  
##                    254) comfort>=0.5 10   1  Pos (0.90000000 0.10000000) *
##                    255) comfort< 0.5 484 165 Neg (0.34090909 0.65909091)  
##                      510) well>=0.5 16   4  Pos (0.75000000 0.25000000) *
##                      511) well< 0.5 468 153 Neg (0.32692308 0.67307692)  
##                       1022) recommend>=0.5 17   5  Pos (0.70588235 0.29411765) *
##                       1023) recommend< 0.5 451 141 Neg (0.31263858 0.68736142)  
##                         2046) better>=0.5 14   4  Pos (0.71428571 0.28571429) *
##                         2047) better< 0.5 437 131 Neg (0.29977117 0.70022883)  
##                           4094) like>=0.5 18   7  Pos (0.61111111 0.38888889) *
##                           4095) like< 0.5 419 120 Neg (0.28639618 0.71360382)  
##                             8190) happi>=0.5 7   2  Pos (0.71428571 0.28571429) *
##                             8191) happi< 0.5 412 115 Neg (0.27912621 0.72087379)  
##                              16382) disappoint< 0.5 398 115 Neg (0.28894472 0.71105528)  
##                                32764) money< 0.5 385 115 Neg (0.29870130 0.70129870)  
##                                  65528) first< 0.5 374 115 Neg (0.30748663 0.69251337)  
##                                   131056) just>=0.5 10   4  Pos (0.60000000 0.40000000) *
##                                   131057) just< 0.5 364 109 Neg (0.29945055 0.70054945)  
##                                     262114) drop< 0.5 355 109 Neg (0.30704225 0.69295775)  
##                                       524228) bad< 0.5 347 109 Neg (0.31412104 0.68587896)  
##                                        1048456) poor< 0.5 339 109 Neg (0.32153392 0.67846608)  
##                                          2096912) terribl< 0.5 331 109 Neg (0.32930514 0.67069486)  
##                                            4193824) useless< 0.5 324 109 Neg (0.33641975 0.66358025)  
##                                              8387648) buy< 0.5 314 108 Neg (0.34394904 0.65605096)  
##                                               16775296) car>=0.5 10   4  Pos (0.60000000 0.40000000) *
##                                               16775297) car< 0.5 304 102 Neg (0.33552632 0.66447368)  
##                                                 33550594) qualiti>=0.5 8   3  Pos (0.62500000 0.37500000) *
##                                                 33550595) qualiti< 0.5 296  97 Neg (0.32770270 0.67229730) *
##                                              8387649) buy>=0.5 10   1 Neg (0.10000000 0.90000000) *
##                                            4193825) useless>=0.5 7   0 Neg (0.00000000 1.00000000) *
##                                          2096913) terribl>=0.5 8   0 Neg (0.00000000 1.00000000) *
##                                        1048457) poor>=0.5 8   0 Neg (0.00000000 1.00000000) *
##                                       524229) bad>=0.5 8   0 Neg (0.00000000 1.00000000) *
##                                     262115) drop>=0.5 9   0 Neg (0.00000000 1.00000000) *
##                                  65529) first>=0.5 11   0 Neg (0.00000000 1.00000000) *
##                                32765) money>=0.5 13   0 Neg (0.00000000 1.00000000) *
##                              16383) disappoint>=0.5 14   0 Neg (0.00000000 1.00000000) *


We can see the rationale of the decision tree. great arrives on top, with presence in 68 training reviews – we saw in the interactive wordcloud and histogram that the word (token) frequency was in fact 69 so there must be a review with twice the word (token) great. great is present in 64 reviews with positive sentiment polarity and only in 4 reviews with negative sentiment polarity.

The second token in the decision tree is good, present in 45 reviews, of which 38 reviews with positive sentiment polarity.

comfort is present in 10 training reviews, of which only 1 is negative. It comes before recommend with presence in 17 training reviews but 5 of them have negative sentiment polarity.

Below comfort we also find like with presence in 18 training reviews but 7 of them have negative sentiment polarity.

Now, it is time we went back to false negatives containing positive subjective information words (tokens) that has not been used to rightly predict positive sentiment polarity.

To visually check that positive subjective information words (tokens) can indeed flip sentiment polarity and help avoid some false negatives, let’s collect false negatives containing some of these words, e.g. 

  • glad,
  • impressed,
  • joy,
  • awesome,
  • fine,
  • rocks,
  • fast,
  • prompt,
  • and sturdy.
# Patterns we are looking for in false negatives

patterns <- c("glad", "impressed", "joy", "awesome", "fine", "rocks", "fast", "prompt", "sturdy")

# Collapsing all words, with the operator | between words.

patterns <- paste(patterns, collapse = "|")

# A data frame with the false negatives has already been created. It has been called df_fn. Column names are "Row", "Training Review", and "Tokenized". Let's onl keep rows with at least one of the words contained in "patterns". 

# Let's filter.

df <- df_fn %>%
  filter(str_detect(`Training Reviews leading to False Negatives`, 
                    patterns) == TRUE) %>%
  `colnames<-`(c("Row",
               "False Negatives with Positive Information",
               "Tokenized"))

# Creating the interactive data table, using the DT package. 

datatable(df, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )

That positive subjective information has not been used to predict positive polarity for the corresponding reviews.

We can think of two possible reasons: on the one hand, maybe these words were also present in numerous reviews with actual negative sentiment polarity; on the other hand, these words do not show up in the wordcloud, which means their occurrence frequency is at best not high and at worst very limited.

Acting on the first reason could be done by choosing a more performant algorithm than rpart, e.g. random forest. This will not be done at this stage because random forest can have a tendency to stick to data, even to outliers, on the training set and to be somewhat disappointing on the validation set (overfitting). This might camouflage problems.

It would be possible to act on the second reason: regrouping, in one way or another, the words (tokens) containing positive subjective information might be an avenue of research.

This looks like an interesting insight.

In conclusion, it might be impactful to garner subjective information conveyed by tokens such as “super”, “prettier”, “infatu”, “awesom”, etc. Since CART doesn’t do it, why not replace such tokens with a generic positive token? This would empower subjective information by building high frequency generic tokens only typified by sentiment orientation.

In this project, polarity of some tokens conveying positive subjective information will be inserted in additional files. That is one avenue of improvement that will be investigated in …

In a similar way, negative subjective information can also impact sentiment polarity. Let’s have a look at training reviews leading to false positives with the tuned rpart model.


# To identify false positives, we need both the actual review polarity and the predicted review polarity. Consequently, we are going to combine both variables in one data frame.  

df <- data.frame(sentiment = reviews_training$sentiment,
                 pred = fitted_cart_tuned_av0) 

# We have a false negative if actual review polarity is positive and if predicted review polarity is negative. If CART delivers a false negative for a specific row, then the next command below produces 1; if it is a false positive, the result is -1; a true positive or a true negative gives 0. So, 1 corresponds to what we are looking for, i.e. false negatives, -1 corresponds to false positives and 0 corresponds to either true positives or true negatives.  

FP_train <- ifelse(df$sentiment == "Neg", 1, 0) - 
            ifelse(df$pred == "Neg", 1, 0)

# Now, we have to generate a dichotomic vector with one specific value for false negatives or another specific value for all other cases (false positives, true positives or true negatives). That's exactly what the next command does. Indeed, if the command above gives 1 (false negative), then the command below delivers 1 as well while delivering 0 in all other cases (false positives, true positives or true negatives). 

FP_train <- ifelse(FP_train == 1, 1, 0)

# Row numbers corresponding to false negatives

FP <- which(FP_train == 1)

# Now let's build up an interactive table with all false negatives delivered by CART with cp tuning. 

# Let's create a receptacle data frame.

df_fp <- data.frame(row = FP,
                 review = as.character(1:length(FP)),
                 tokenized = as.character(1:length(FP))) %>%
  `colnames<-`(c("Row", 
                 "Training Reviews leading to False Positives", 
                 "Tokenized"))

# In order to populate the receptacle data frame, let's build up a for loop garnering data, i.e. row number, training review and tokenized training review.

for (i in 1:length(FP)) {
  row <- FP[i]
  df_fp[i, 2] <- reviews_training$text[row]
  df_fp[i, 3] <- corpus_av0[[row]]$content
}

rm(i, row)

# Creating the interactive data table, using the DT package. 

datatable(df_fp, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )

In parallel with false negatives, some training reviews leading to false positives contain subjective information that could flip sentiment polarity to positiveness. This time, it is of course negative subjective information. Here are a few examples: unusable, embarrassing, and unreliable.

Consequently some words and tokens with negative subjective information will be inserted into additional files.

But perusing false positives leads to another statement: sentiment polarity is often flipped by negation. This means that negation should be excluded from the stopwords that are discarded.

Let’s switch now to negation.


6.2 Negation


Another category of words (tokens) can also flip sentiment polarity: negational unigrams, or, simplier, negation, just as not or no. Among false negatives, we could notice some occurrences of negation that flipped sentiment polarity but that could obviously not be taken into account since these negational unigrams were considered as stopwords and had, for that reason, been discarded from tokens.

Let’s start with false positives because we expect more negation occurrences among false positives than among false negatives.


# Let's determine the patterns we are looking for in false negatives. Since most comments are capitalized, variants have been provided with capitalization. Around some negational unigrams, there is a leading empty space character and/or a trailing empty space character in order to avoid picking up longer unigrams containing some negational unigrams, e.g. "notice" instead of "no" or instead of "not". 

patterns <- c("neither", "Neither", "never", "Never", " no ", "No ", "none", "None", " nor ", " not ", "Not ", "nothing", "Nothing")

# Collapsing all words, with the operator | between words.

patterns <- paste(patterns, collapse = "|")

# A data frame with the false negatives has already been created. It has been called df_fn. Column names are "Row", "Training Review", and "Tokenized". Let's keep only rows with at least one of the words contained in "patterns". 

# Let's filter.

df <- df_fp %>%
  filter(str_detect(`Training Reviews leading to False Positives`, 
                    patterns) == TRUE) %>%
  `colnames<-`(c("Row", "False Positive Review with Negation",                      "Tokenized"))

# Creating the interactive data table, using the DT package. 

datatable(df, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


# Let's determine the patterns we are looking for in false negatives. Since most comments are capitalized, variants have been provided with capitalization. Around some negational unigrams, there is a leading empty space character and/or a trailing empty space character in order to avoid picking up longer unigrams containing some negational unigrams, e.g. "notice" instead of "no" or instead of "not". 

patterns <- c("neither", "Neither", "never", "Never", " no ", "No ", "none", "None", " nor ", " not ", "Not ", "nothing", "Nothing")

# Collapsing all words, with the operator | between words.

patterns <- paste(patterns, collapse = "|")

# A data frame with the false negatives has already been created. It has been called df_fn. Column names are "Row", "Training Review", and "Tokenized". Let's keep only rows with at least one of the words contained in "patterns". 

# Let's filter.

df <- df_fn %>%
  filter(str_detect(`Training Reviews leading to False Negatives`, 
                    patterns) == TRUE) %>%
  `colnames<-`(c("Row", "False Negative Review with Negation",                      "Tokenized"))

# Creating the interactive data table, using the DT package. 

datatable(df, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )

The bigrams no problem" or no trouble* are clear from a human point of view but this bigram has become troubl, the negational token no having been removed with all other stopwords. Even if troubl is polarized under a generic negative token, as suggested above, the right polarity of no trouble wouldn’t show. Two avenues are opened up: the whole bigram no trouble could be converted into a generic token with positive orientation or, more generally, negational stopwords such as not or no could no longer be removed, which is another avenue for improvement.

Actually, the number of training reviews with not or no is limited among the false negatives. But it was much higher among false positives. Moreover, frequency could also be higher in the validation set, which we do not know. Keeping negational unigrams will be given a try in …

Up to now, negation has been shown in negational unigrams but negation can also be encapsulated into negative short forms (also called “contractions”) and they could impact both false negatives and false positives. Consequently, keeping negative short forms will also be given a try.


6.3 Negative Short Forms

Among false positives, there are many negative short forms that flip sentiment polarity to negativeness. There are also some among false negatives.

Consequently, negative short forms will also be removed from the stopwords that are discarded.


6.4 Multigrams


Sentiment can also be expressed through associations of words beyond negation cases already treated.

In some cases, these are rather stereotyped phrases. Let’s have a look at a few training reviews leading to false negatives but containing multigrams whose consideration could flip sentiment polarity prediction to the right status, i.e. positive sentiment polarity.


# Patterns we are looking for in false negatives

patterns <- c(" a bargain ", " a winner ", " any problem "," five star ", " must have ", " no problem "," thumbs up ", "Whoa", "whoa")

# Collapsing all words, with the operator | between words.

patterns <- paste(patterns, collapse = "|")

# A data frame with the false negatives has already been created. It has been called df_fn. Column names are "Row", "Training Review", and "Tokenized". Let's onl keep rows with at least one of the words contained in "patterns". 

# Let's filter.

df <- df_fn %>%
  filter(str_detect(`Training Reviews leading to False Negatives`, 
                    patterns) == TRUE) %>%
  `colnames<-`(c("Row",
               "False Negatives with Positive Information",
               "Tokenized"))

# Creating the interactive data table, using the DT package. 

datatable(df, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


In the table above, we can see 7 examples of false negatives containing positive multigrams that can flip sentiment polarity to positiveness. These are only examples.


6.6 Context

Some reviews have delivered some more difficult cases. Here are a few examples among false positives.


# A data frame with the false negatives has already been created. It has been called df_fn. Column names are "Row", "Training Review", and "Tokenized". Let's keep only rows with at least one of the words contained in "patterns". Let's pick up 5 rows with figurative usage, slang, sarcasm or multifaceted wording. 

df <- df_fp %>% 
  filter(Row %in% c(54, 111, 113, 433, 586)) %>%
  `colnames<-`(c("Row", "False Positive Review with Negative Sentiment",                      "Tokenized"))

# Creating the interactive data table, using the DT package. 

datatable(df, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
            initComplete = JS(
              "function(settings, json) {",
              "$(this.api().table().header()).css({
                  'background-color': '#a41034', 
                  'color': 'white'});", 
              "}"),
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "#d6c0b0";','}',
              '}')
            )
          )


There can be figurative wording, sarcasm, irony, metaphors, multifaceted reviews, etc.

The table above gives five examples of more complex wording:

  • metaphor such as crawl referring to slowness;
  • slang such as crap in review 586;
  • sarcasm such as in review 113;
  • multifaceted review mentioning of strong and weak points such as in reviews 54 and 433.

Some metaphors and some slang can enter additional files. Of course, rpart a word like crawl used as a metaphor or slang such as crap. But among the training reviews, the occurrence frequency of these words is very low; knowing that the frequency among the training reviews selected at random is very limited, the conditional probability of having them in the validation set is also very limited; but including them into an additional file of subjective information (in these cases with negative sentiment polarity) can only be harmless if not useful.

Sarcasm is out of reach in this working paper.

Multifaceted wording might be better tackled by more sophisticated models below. A very simple trick will get a try: in case of multifaceted wording, the word but often indicates restriction and is often the dominant meaning: but will be removed drom stopwords and considered as a negative unigram.

Let’s check up that but is most of the time introducing impactful negative information. We’ll do that by collecting all training reviews containing the word but.


# Building up data frame.

tab <- reviews_training %>% 
  `colnames<-`(c("Row Number", "Training Review", "Sentiment")) %>%
  filter(str_detect(`Training Review`, "but ") == TRUE)

# Building up interactive presentation table.

datatable(tab, rownames = FALSE, filter = "top", 
          options = list(pageLength = 5, scrollX = T,
                         
          # Setting background color and font color in header.               
                         
            initComplete = JS(
              'function(settings, json) {',
              '$(this.api().table().header()).css({
                  "background-color": "#507786", 
                  "color": "white"});', 
              '}'),
            
            # Setting background color in rows. 
            
            rowCallback = JS(
              'function(row, data, index, rowId) {',
              'console.log(rowId)',
              'if(rowId >= 0) {',
                   'row.style.backgroundColor = "powderblue";','}',
              '}')
            )
          )
rm(tab)

Using the interactive table above, we can easily notice that 19 reviews with but in them lead to negative sentiment polarity. Consequently, we are going to remove but from the stopwords and let it stand on its own as a token in the bag of words.


6.7 Conclusion


In text mining, insights have been obtained

  • by comparing token frequency in the bag of words (wordcloud and histogram) and in token lists from CART and Random Forest
  • and by perusing false negatives and positives in CART and Random Forest.

Among insights, let’s mention:

  • topic-related tokens predominate in the bag of words;
  • topic-related tokens show in limited number and at a lower level in the CART decision tree and in the predictor importance list from Random Forest;
  • subjective information tokens predominate in the CART decision list and in the predictor importance list from Random Forest;
  • many subjective information tokens show in false negatives or false positives but neither in the CART decision tree nor in the predictor importance list from Random Forest (first 20 positions);
  • many negational n-grams are present in reviews giving false negatives or positives and often reverse sentiment polarity of the reviews but they cannot be made actionable in machine learning since they have been removed from the bag of words as stopwords;
  • these negational n-grams can be negational unigrams such as “not” or negative short forms such as “isn’t”.

In the next section, these text mining insights will be tentatively transposed into NLP and machine learning actions towards more accuracy.

Three avenues of improvement have been opened up:

  • integrating negational unigrams (not, etc.);
  • integrating negative short forms (isn’t, etc.);
  • establishing polarized lists of subjective information tokens and replacing instances of these tokens in reviews with one generic token, either positive or negative.

Stepwise, the three avenues will be quantitatively tested.

The whole research has been performed only on training reviews without any kind of intermixture with validation reviews.


7 Predicting with Text Mining

7.1 Negation


Negational unigrams have been introduced, NLP has be rerun as well as the CART model with tuning, which is used as a performance yardstick. Here are the results.


# Building up new corpus.

corpus_av1_a <- VCorpus(VectorSource(reviews_training$text)) 
corpus_av1_a <- tm_map(corpus_av1_a, content_transformer(tolower))

# Replacing all punctuation marks with white space characters, instead of just removing punctuation marks, to prevent tokens like "brokeni" from being generated. Keeping apostrophes to leave intact short forms such as "don't" so that they can be removed as stopwords.  

for (i in 1:nrow(reviews_training)) {
  corpus_av1_a[[i]]$content <- gsub("(?!')[[:punct:]]", " ", 
                                    corpus_av1_a[[i]]$content, perl = TRUE)
}
rm(i)

# Removing short forms after regulating white space characters.

corpus_av1_a <- tm_map(corpus_av1_a, stripWhitespace)
corpus_av1_a <- tm_map(corpus_av1_a, removeWords, short_forms_neg)
corpus_av1_a <- tm_map(corpus_av1_a, removeWords, short_forms_pos)

# Removing remaining apostrophes (there can be apostrophes outside of short forms). 

for (i in 1:nrow(reviews_training)) {
  corpus_av1_a[[i]]$content <- gsub("[[:punct:]]", " ", 
                                    corpus_av1_a[[i]]$content)
}
rm(i)

# Removing stopwords_remaining, stemming, removing numbers, digits and multiple white space characters (leaving only one white space character at a time).

corpus_av1_a <- tm_map(corpus_av1_a, removeWords, stopwords_remaining)
corpus_av1_a <- tm_map(corpus_av1_a, stemDocument)
corpus_av1_a <- tm_map(corpus_av1_a, removeNumbers)
corpus_av1_a <- tm_map(corpus_av1_a, stripWhitespace)

# Building bag of words, managing sparsity threshold, converting to data frame, regularizing column names and adding dependent variable.

dtm_av1_a <- DocumentTermMatrix(corpus_av1_a)
sparse_av1_a <- removeSparseTerms(dtm_av1_a, 0.995)
sentSparse_av1_a <- as.data.frame(as.matrix(sparse_av1_a)) 
colnames(sentSparse_av1_a) <- make.names(colnames(sentSparse_av1_a))
sentSparse_av1_a <- sentSparse_av1_a %>% 
  mutate(sentiment = reviews_training$sentiment)

# Training CART with the algorithm rpart with cp tuning.

set.seed(1)
fit_cart_tuned_av1_a <- train(sentiment ~ .,
                              method = "rpart",
                              data = sentSparse_av1_a,
                              tuneLength = 15,
                              metric = "Accuracy")
fitted_cart_tuned_av1_a <- predict(fit_cart_tuned_av1_a)
cm_cart_tuned_av1_a <- confusionMatrix(as.factor(fitted_cart_tuned_av1_a), 
                      as.factor(sentSparse_av1_a$sentiment))

# Table comprised of accuracy. 

tab <- data.frame(cm_cart_tuned_av1_a$overall["Accuracy"]) %>%
       `rownames<-`("Model: Neg Short Forms + CART + Tuning") %>%
       `colnames<-`("Accuracy on the Training Set")

# Layout

knitr::kable(tab, "html", align = "c") %>% 
       kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
       column_spec(1, bold = T, color = "#808080") %>%
       column_spec(2, bold = T, color = "white", background = greenish_blue)
Accuracy on the Training Set
Model: Neg Short Forms + CART + Tuning 0.7949102
rm(tab)


There is accuracy improvement approximately from 78 % to 79 %. Consequently, negational unigrams such as “not” will be kept in the corpus.

For the record, does not show in the decision tree?


prp(fit_cart_tuned_av1_a$finalModel, uniform = TRUE, cex = 0.8, 
    box.palette = c(super_light_gray, super_light_taupe))


Yes, indeed it does and rather predominantly!


7.2 B. Negative Short Forms

7.2.1 1. Adding Negative Short Forms


First, negative short forms will no longer be removed from the corpus and will thus enter the bag of words. Impact on accuracy will be tested.


# Building up new corpus.

corpus_av1_b <- VCorpus(VectorSource(reviews_training$text)) 
corpus_av1_b <- tm_map(corpus_av1_b, content_transformer(tolower))

# Replacing all punctuation marks with white space characters, instead of just removing punctuation marks, to prevent tokens like "brokeni" from being generated. Keeping apostrophes to leave intact positive short forms such as "it's" so that they can be removed. 

for (i in 1:nrow(reviews_training)) {
  corpus_av1_b[[i]]$content <- gsub("(?!')[[:punct:]]", " ", 
                                    corpus_av1_b[[i]]$content, perl = TRUE)
}
rm(i)

# Removing only positive short forms after reducing to one the number of white space characters in a row in a row.

corpus_av1_b <- tm_map(corpus_av1_b, stripWhitespace)
corpus_av1_b <- tm_map(corpus_av1_b, removeWords, short_forms_pos)

# Removing remaining apostrophes. 

for (i in 1:nrow(reviews_training)) {
  corpus_av1_b[[i]]$content <- gsub("[[:punct:]]", " ", 
                                    corpus_av1_b[[i]]$content)
}
rm(i)

# Removing stopwords_remaining, stemming, removing numbers, digits and multiple white space characters (leaving only one white space character at a time).

corpus_av1_b <- tm_map(corpus_av1_b, removeWords, stopwords_remaining)
corpus_av1_b <- tm_map(corpus_av1_b, stemDocument)
corpus_av1_b <- tm_map(corpus_av1_b, removeNumbers)
corpus_av1_b <- tm_map(corpus_av1_b, stripWhitespace)

# Building bag of words, managing sparsity threshold, converting to data frame, regularizing column names and adding dependent variable.

dtm_av1_b <- DocumentTermMatrix(corpus_av1_b)
sparse_av1_b <- removeSparseTerms(dtm_av1_b, 0.995)
sentSparse_av1_b <- as.data.frame(as.matrix(sparse_av1_b)) 
colnames(sentSparse_av1_b) <- make.names(colnames(sentSparse_av1_b))
sentSparse_av1_b <- sentSparse_av1_b %>% 
  mutate(sentiment = reviews_training$sentiment)

# Training CART with the algorithm rpart with cp tuning.

set.seed(1)
fit_cart_tuned_av1_b <- train(sentiment ~ .,
                              method = "rpart",
                              data = sentSparse_av1_b,
                              tuneLength = 15,
                              metric = "Accuracy")
fitted_cart_tuned_av1_b <- predict(fit_cart_tuned_av1_b)
cm_cart_tuned_av1_b <- confusionMatrix(as.factor(fitted_cart_tuned_av1_b), 
                                       as.factor(sentSparse_av1_b$sentiment))

# Table comprised of accuracy

tab <- data.frame(cm_cart_tuned_av1_b$overall["Accuracy"]) %>%
       `rownames<-`(
          "Model: Negation + Neg Short Forms + CART + Tuning") %>%
       `colnames<-`("Accuracy on the Training Set")

# Layout

knitr::kable(tab, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = "#808080") %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue)
Accuracy on the Training Set
Model: Negation + Neg Short Forms + CART + Tuning 0.7949102
rm(tab)


Adding negative short forms such as “isn’t” does not impact accuracy level. Consequently, another try will be done with negative short forms: instead of being added, negative short forms will be replaced with “not”.


7.2.2 2. Replacing Negative Short Forms with " not "


Positive short forms will still be removed from the corpus with the function removeWords() from the package tm, which removes separate words and no substrings.

Negative short forms will be searched for with the function gsub(). To avoid picking up substrings as well, one white space character will be added in front of all negative short forms and at the end of each of them. Indeed, let’s not forget that misspelled short forms such as “dont” have been deliberately introduced as well among negative short forms to take into account “alternative grammar”, which is omnipresent in reviews. And words such as “dont” can be substrings out of other words. The gsub() function will replace the “escorted” short forms with " not ".

Let’s have a look at the new accuracy level.


# Building up new corpus.

corpus_av1_c <- VCorpus(VectorSource(reviews_training$text)) 
corpus_av1_c <- tm_map(corpus_av1_c, content_transformer(tolower))

# Replacing all punctuation marks with white space characters, instead of just removing punctuation marks, to prevent tokens like "brokeni" from being generated. Keeping apostrophes to leave intact short forms such as "it's" so that positive short forms can be removed. 

for (i in 1:nrow(reviews_training)) {
  corpus_av1_c[[i]]$content <- gsub("(?!')[[:punct:]]", " ", 
                                    corpus_av1_c[[i]]$content, perl = TRUE)
}
rm(i)

# Adding one white space character at the beginning and at the end of each negative short form in order to prepare to use the function gsub() without picking up substrings. 

dummy <- paste("", short_forms_neg, "")

# Replacing negative short forms with " not ".

for (i in 1:nrow(reviews_training)) {
  for (j in 1:length(short_forms_neg)) {
    corpus_av1_c[[i]]$content <- gsub(dummy[j], " not ", 
                                      corpus_av1_c[[i]]$content)
  }
}
rm(dummy)

# Removing only positive short forms after reducing to one the number of white space characters in a row in a row.

corpus_av1_c <- tm_map(corpus_av1_c, stripWhitespace)
corpus_av1_c <- tm_map(corpus_av1_c, removeWords, short_forms_pos)

# Removing remaining apostrophes. 

for (i in 1:nrow(reviews_training)) {
  corpus_av1_c[[i]]$content <- gsub("[[:punct:]]", " ", 
                                    corpus_av1_c[[i]]$content)
}
rm(i)

# Removing stopwords_remaining, stemming, removing numbers, digits and multiple white space characters (leaving only one white space character at a time).

corpus_av1_c <- tm_map(corpus_av1_c, removeWords, stopwords_remaining)
corpus_av1_c <- tm_map(corpus_av1_c, stemDocument)
corpus_av1_c <- tm_map(corpus_av1_c, removeNumbers)
corpus_av1_c <- tm_map(corpus_av1_c, stripWhitespace)

# Building bag of words, managing sparsity threshold, converting to data frame, regularizing column names and adding dependent variable.

dtm_av1_c <- DocumentTermMatrix(corpus_av1_c)
sparse_av1_c <- removeSparseTerms(dtm_av1_c, 0.995)
sentSparse_av1_c <- as.data.frame(as.matrix(sparse_av1_c)) 
colnames(sentSparse_av1_c) <- make.names(colnames(sentSparse_av1_c))
sentSparse_av1_c <- sentSparse_av1_c %>% 
  mutate(sentiment = reviews_training$sentiment)

# Training CART with the algorithm rpart with cp tuning.

set.seed(1)
fit_cart_tuned_av1_c <- train(sentiment ~ .,
                              method = "rpart",
                              data = sentSparse_av1_c,
                              tuneLength = 15,
                              metric = "Accuracy")
fitted_cart_tuned_av1_c <- predict(fit_cart_tuned_av1_c)
cm_cart_tuned_av1_c <- confusionMatrix(as.factor(fitted_cart_tuned_av1_c), 
                                       as.factor(sentSparse_av1_c$sentiment))

# Table comprised of accuracy

tab <- data.frame(cm_cart_tuned_av1_c$overall["Accuracy"]) %>%
  `rownames<-`('Model: Negation + [Neg Short Forms = "not"] + CART + Tuning') %>%
  `colnames<-`("ACCURACY ON THE TRAINING SET")

# Layout

knitr::kable(tab, "html", align = "c") %>% 
  kable_styling(bootstrap_options = "bordered", 
                full_width = F, font_size = 16) %>%
  column_spec(1, bold = T, color = "#808080") %>%
  column_spec(2, bold = T, color = "white", background = greenish_blue)
ACCURACY ON THE TRAINING SET
Model: Negation + [Neg Short Forms = “not”] + CART + Tuning 0.7889222
rm(tab)


Replacing negative short forms with " not " downgrades accuracy. This path will not be followed.


7.3 C. Polarization


In samples of false negatives and false positives, analysis has pinpointed unigrams and multigrams that convey subjective information.

In the line of these insights, these n-grams have been listed and classified as positive and negative. They have been inserted into four files and the files have been uploaded in the GitHub repository https://github.com/Dev-P-L/Sentiment-Analysis :

  • subj_pos_multigrams.csv,
  • subj_pos_unigrams.csv,
  • subj_neg_multigrams.csv,
  • subj_neg_unigrams.csv.

Here are a few examples from each file of polarized n-grams.

Positive sentiment oriented unigrams from subj_pos_unigrams.csv (stemmed): “super”, “awesom”, etc.

Some positive multigrams from the file sub_pos_multigrams.csv (not stemmed): “no trouble”, “5 stars”, “thumbs up”, “it’s a ten”, “as described”, “know what they’re doing”, “must have”. Possible variants have usually been added, including variants originating from spelling errors or “alternative grammar”: “no troubles”, “not any trouble”, “not any troubles”, “no problem”, “no problems”, etc.; “five stars”, “five star”, “5-star”, “5star”, “5 star”; “it’s a 10”, “it’s a ten”, “its a 10”, etc.; “know what theyre doing”, “know what they are doing”, etc.

Some negative unigrams (after stemming) from the file subj_neg_unigrams.csv: “horribl”, “crap”, “whine”, etc.

Some negative multigrams (not stemmed) from the file sub_neg_multigrams.csv: “1 star”, “one star”, “not good”, “no good”, “shouldn’t” (often associated with negative context), “pretty piece of junk”, etc.

In the training reviews, instances of the positive n-grams will be replaced with " subjpo " and instances of negative n-grams with " subjneg ".

Efficacy-minded rules will be applied in this NLP process.

First, the polarized n-grams will be preceded and followed by one white space character when looking for instances in reviews. Otherwise, in the bag of words, the n-gram “most inconvi” would become “most in subjpo” (because “convi” is a polarized unigram in subj_pos_unigrams.csv) and then " subjpo " (because “most” and “in” are stopwords in stopwords_remaining.csv)! A negatively oriented multigram would become a positively oriented unigram! Consequently, one white space character is added in front of and at the end of each polarized n-gram before looking for matching instances in NLP-transformed reviews, in order to avoid replacing substrings.

Second, as a consequence, a white space character has to be added at the beginning and at the end of each NLP-transformed review! Otherwise, polarized n-grams, which are preceded and followed by one white space character can never match an instance that is positioned at the beginning or at the end of a review.

Third, " subjpo " and " subjneg " contain one white space character at the beginning and at the end, in order to prevent amalgamation. Indeed, what would happen if white space characters were not added? Let’s take our well known example of " conveni “: if it were replaced with just”subjpo" in the n-gram " most conveni “, then it would produce” mostsubjpo", which would no longer be a generic positive unigram! Transformation would be useless if not annoyingly counterproductive!

Fourth, multiple inter word white space characters have to be reduced to a single inter word white space character: indeed, listed multigrams only have one white space character between words and could never match multigrams from reviews with several white space characters between words.

Fifth, in training reviews, negative multigrams have got to be replaced before positive multigrams. Let’s take the example of " not a good bargain “, which is a negatively polarized multigram from the file subj_neg_multigrams.csv: if matching with instances in reviews begins with positively polarized n-grams, then” not a good bargain " in a review becomes " not a subjpo “, which might be less clear in machine learning than” subjneg “! For similar reasons, positive multigrams are matched before negative unigrams and positive unigrams. Sixth, negative or positive polarized multigrams should be tentatively matched in decreasing order in for loops. Why? Let’s take the example of” no good bargain " in one review. In sub_neg_multigrams.csv, there are two negatively polarized multigrams: " no good bargain " and " no good “; if these are considered in decreasing order, then, in the review,” no good bargain " is replaced with " subjneg “, which looks appropriate; otherwise” no good bargain " is replaced with " subjneg bargain " and then " subjneg subjpo ": consequently, instead of having one negative generic unigram we would get one positive and one negative generic unigrams!

NLP will be rerun again. In each training review, all n-grams that match positive n-grams from subj_pos_multigrams.csv or subj_pos_unigrams.csv will be replaced with a generic positive token (" subjpo “); all n-grams that match negative n-grams from subj_neg_multigrams.csv or subj_neg_unigrams.csv will be replaced with a generic negative token (” subjneg ").

The utf8 package will be used to normalize punctuation: there has been some trouble with curly apostrophes instead of straight apostrophes.